The training dataset contains 12795 observations of 16 variables (one index, one response, and 14 predictor variables).
Each record (row) represents a range of parameters of a wine type being sold such as its chemical properties. The continuous response variable TARGET represents the number of cases of wine that are sold as tasting samples to restaurants and wine stores around the United States.
The variables are: –INSERT PICTURE BELOW–
Summaries for the individual variables are provided below.
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
From the summaries and the chart above we can see that all variables are continuous and that multiple variables have missing data, but the amount of NAs is not very high with the exception of the STARS variable.
A check for near-zero variance did not show a positive result for any variable.
Per-variable distribution analysis is provided below (excluding the INDEX variable, which is immaterial to the analysis and would not be regarded further).
The pairwise correlations between the continuous variables are displayed below
Model summary
From the model summary we can see the following:
Interpretation of the regression coefficients
The diagnostic plots for the model can be generated using the R code provided in the appendix.
For the second model, the following changes are made:
Model summary
The interpretation of the coefficients has stayed the same as in the full model.
The performance of the continous models will be compared based on RMSE on the out-of sample data
The RMSE for the first (full) model is lower. From the charts below it is clear that the model two consistently produces very low values as compared to the true result.
So the initial full model will be selected for now to produce predictions on the evaluation data. However, further tuning could provide better precision of the predictions.
Predictions on the evaluation dataset
Predictions on the evaluation dataset are made using the model m1_cont.
The output of the model on the evaluated data is available under the following URL: